[SYSTEMDS-3524] Multi-threading of transformdecode/[SYSTEMDS-3521] Improved Feature Transformations#2275
[SYSTEMDS-3524] Multi-threading of transformdecode/[SYSTEMDS-3521] Improved Feature Transformations#2275Isso-W wants to merge 42 commits intoapache:mainfrom
Conversation
…e done, test passes for Bin
…e done, test passes for Bin
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2275 +/- ##
============================================
- Coverage 72.58% 72.55% -0.04%
- Complexity 46221 46275 +54
============================================
Files 1489 1496 +7
Lines 174193 174561 +368
Branches 34182 34232 +50
============================================
+ Hits 126434 126646 +212
- Misses 38196 38347 +151
- Partials 9563 9568 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for the patch. I will take a look into this next week @Isso-W. |
small error in test array
# Conflicts: # src/test/java/org/apache/sysds/test/functions/transform/ColumnDecoderMixedMethodsTest.java
| import java.io.IOException; | ||
| import java.io.ObjectInput; | ||
| import java.io.ObjectOutput; | ||
| import java.util.*; |
There was a problem hiding this comment.
Avoid all import. Only import the required classes.
| for( int j=0; j<_colList.length; j++ ) { | ||
| int colID = _colList[j]; | ||
| double val = UtilFunctions.objectToDouble( | ||
| out.getSchema()[colID-1], out.get(i, colID-1)); | ||
| long key = UtilFunctions.toLong(val); | ||
| out.set(i, colID-1, getRcMapValue(j, key)); | ||
| } |
There was a problem hiding this comment.
Why are you iterating all the columns? A column decoder should be called for each column.
| if( _onOut ) { //recode on output (after dummy) | ||
| for( int i=rl; i<ru; i++ ) { | ||
| for( int j=0; j<_colList.length; j++ ) { | ||
| int colID = _colList[j]; | ||
| double val = UtilFunctions.objectToDouble( | ||
| out.getSchema()[colID-1], out.get(i, colID-1)); | ||
|
|
There was a problem hiding this comment.
Remove these empty lines that you added.
| protected int[] _colList; | ||
| protected String[] _colnames = null; | ||
| protected ColumnDecoder(ValueType[] schema, int[] colList) { | ||
| _schema = schema; | ||
| _colList = colList; | ||
| } |
There was a problem hiding this comment.
Why a column list? A column encoder should work on a single column.
| long b1 = System.nanoTime(); | ||
| out.ensureAllocatedColumns(in.getNumRows()); | ||
|
|
||
| final int outColIndex = _colList[0] - 1; |
There was a problem hiding this comment.
Why the outColIndex is always _colList[0] - 1?
| for (int j = 0; j < _colList.length; j++) { | ||
| double val = in.get(i, j); | ||
| if (!Double.isNaN(val)) { | ||
| int key = (int) Math.round(val); | ||
| double bmin = _binMins[j][key - 1]; | ||
| double bmax = _binMaxs[j][key - 1]; | ||
| double oval = bmin + (bmax - bmin) / 2 + (val - key) * (bmax - bmin); | ||
| out.getColumn(_colList[j] - 1).set(i, oval); | ||
| } else { | ||
| out.getColumn(_colList[j] - 1).set(i, val); | ||
| } | ||
| } |
There was a problem hiding this comment.
I don't understand why you are iterating all columns in a column decoder.
| for( int j=0; j<_colList.length; j++ ) | ||
| for( int k=_clPos[j]; k<_cuPos[j]; k++ ) | ||
| if( in.get(i, k-1) != 0 ) { | ||
| int col = _colList[j] - 1; | ||
| Object val = UtilFunctions.doubleToObject(out.getSchema()[col], k-_clPos[j]+1); | ||
| synchronized(out) { out.set(i, col, val); } | ||
| } | ||
| } |
There was a problem hiding this comment.
A column decoder should work on a single column that is provided.
|
@Isso-W, please address the comments. And fix the tests as well. Your tests are failing in the transform package, which you should be able to reproduce. |
…erly on Bin and Pass-through.
# Conflicts: # src/test/java/org/apache/sysds/test/functions/transform/ColumnDecoderMixedMethodsTest.java
|
The latest changes look good @Isso-W. |
|
@Isso-W, can you please post your plots of FTBench (decoding) here for others to see? |
|
Sure @phaniarnab, here is the plot.
I also created a new test using part of FTbench T9 by running DML, I can also put that test with responding json in PR to make the plot reproduciable. B.t.w. will this PR be merged into main? |
….csv, flight.csv is too large to push



This pull request introduces a new framework for column decoding in Apache SystemDS, with the addition of a base class
ColumnDecoderand several specialized implementations (ColumnDecoderBin,ColumnDecoderComposite,ColumnDecoderRecode,ColumnDecoderPassthroughandColumnDecoderDummycode). These changes provide a flexible and extensible structure for decoding encoded data in matrix-to-frame transformations. Below are the most important changes grouped by theme:Core Framework for Column Decoding
ColumnDecoderas an abstract base class to define the structure for decoding operations, including methods for decoding (columnDecode), handling sub-range decoding (subRangeDecoder), and metadata initialization (initMetaData). It also implementsExternalizablefor efficient serialization.Current Issues
ColumnDecoderDummycodeist not supported yet, as well as the test caseColumnDecoderMixedMethodsTestDecoderDummyCode, it do not work together withDecoderRecode